{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Excel files and CSV into Python Pandas\n", "\n", "Loading data into Python from standard Excel files and comma-separated value (CSV) data is fundamental for many data scientists and analysts. \n", "\n", "Excel and CSV files are among the most common data storage formats. Python provides various tools to read, manipulate, and analyze this data. In this process, Excel files can be read using libraries like `pandas`, `xlrd`, and `openpyxl`. In contrast, CSV files can be imported using the built-in `csv` module or the `pandas` library. \n", "\n", "Understanding how to load data from these file formats is essential for data analysis and machine learning tasks in Python. This article will explore different methods to read data from Excel and CSV files and analyze some everyday use cases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How To" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.03252-08-01 00:00:00452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.03014-08-01 00:00:00358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.02574-07-01 00:00:00352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.06431-05-01 00:00:00341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.08462-03-01 00:00:00342200.0NEAR BAY
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value \\\n", "0 322.0 126.0 3252-08-01 00:00:00 452600.0 \n", "1 2401.0 1138.0 3014-08-01 00:00:00 358500.0 \n", "2 496.0 177.0 2574-07-01 00:00:00 352100.0 \n", "3 558.0 219.0 6431-05-01 00:00:00 341300.0 \n", "4 565.0 259.0 8462-03-01 00:00:00 342200.0 \n", "\n", " ocean_proximity \n", "0 NEAR BAY \n", "1 NEAR BAY \n", "2 NEAR BAY \n", "3 NEAR BAY \n", "4 NEAR BAY " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel(\"data/housing.xlsx\", engine=\"openpyxl\")\n", "df.head(5)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"data/housing.csv\").head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
orderinv-ordselectionCollTitleCorrectedTitleOrigRecommendsclassficationRecommendsclass_WDclass_WD&CS
05411120.0Angular 2 versus React: There Will Be BloodIn ...Angular 2 versus React: There Will Be Blood2.5KTECH2500.00.00.0
1186351.0The mind-blowing AI announcement from Google t...The mind-blowing AI announcement from Google t...2.4KTECH2400.00.00.0
265030.0Things I Wish Someone Had Told Me When I Was L...Things I Wish Someone Had Told Me When I Was L...3.2KMOT3200.00.00.0
3945591.0How to encrypt your entire life in less than a...How to encrypt your entire life in less than a...7.4KSEC7400.00.00.0
44092440.0Being A Developer After 40In freeCodeCampView ...Being A Developer After 406KMOT6000.00.00.0
.................................
64763617NaNDecember 2014December 2014NaNdateNaNNaNNaN
6486449NaNNovember 2014November 2014NaNdateNaNNaNNaN
6496476NaNOctober 2014October 2014NaNdateNaNNaNNaN
6506494NaNNovember 2013November 2013NaNdateNaNNaNNaN
6516512NaNOctober 2013October 2013NaNdateNaNNaNNaN
\n", "

652 rows × 10 columns

\n", "
" ], "text/plain": [ " order inv-ord selection \\\n", "0 541 112 0.0 \n", "1 18 635 1.0 \n", "2 650 3 0.0 \n", "3 94 559 1.0 \n", "4 409 244 0.0 \n", ".. ... ... ... \n", "647 636 17 NaN \n", "648 644 9 NaN \n", "649 647 6 NaN \n", "650 649 4 NaN \n", "651 651 2 NaN \n", "\n", " CollTitle \\\n", "0 Angular 2 versus React: There Will Be BloodIn ... \n", "1 The mind-blowing AI announcement from Google t... \n", "2 Things I Wish Someone Had Told Me When I Was L... \n", "3 How to encrypt your entire life in less than a... \n", "4 Being A Developer After 40In freeCodeCampView ... \n", ".. ... \n", "647 December 2014 \n", "648 November 2014 \n", "649 October 2014 \n", "650 November 2013 \n", "651 October 2013 \n", "\n", " CorrectedTitle OrigRecommends \\\n", "0 Angular 2 versus React: There Will Be Blood 2.5K \n", "1 The mind-blowing AI announcement from Google t... 2.4K \n", "2 Things I Wish Someone Had Told Me When I Was L... 3.2K \n", "3 How to encrypt your entire life in less than a... 7.4K \n", "4 Being A Developer After 40 6K \n", ".. ... ... \n", "647 December 2014 NaN \n", "648 November 2014 NaN \n", "649 October 2014 NaN \n", "650 November 2013 NaN \n", "651 October 2013 NaN \n", "\n", " classfication Recommends class_WD class_WD&CS \n", "0 TECH 2500.0 0.0 0.0 \n", "1 TECH 2400.0 0.0 0.0 \n", "2 MOT 3200.0 0.0 0.0 \n", "3 SEC 7400.0 0.0 0.0 \n", "4 MOT 6000.0 0.0 0.0 \n", ".. ... ... ... ... \n", "647 date NaN NaN NaN \n", "648 date NaN NaN NaN \n", "649 date NaN NaN NaN \n", "650 date NaN NaN NaN \n", "651 date NaN NaN NaN \n", "\n", "[652 rows x 10 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"https://raw.githubusercontent.com/freeCodeCamp/open-data/master/medium-fCC-data/data/medium_titles%20-%20rawdata.tsv\", sep=\"\\t\", )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change the data loading to contain the correct data types and explore the keywords for reading CSV files and Excel files using `Shift + Tab`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_csv(\"https://raw.githubusercontent.com/freeCodeCamp/open-data/master/medium-fCC-data/data/medium_titles%20-%20rawdata.tsv\", \n", " sep=\"\\t\", )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n", "- [Chris Albon on Loading CSVs](https://chrisalbon.com/python/data_wrangling/pandas_dataframe_importing_csv/)\n", "- [Shane Lynn on `Read_CSV`](https://www.shanelynn.ie/python-pandas-read_csv-load-data-from-csv-files/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }